What is Seaborn?¶

Seaborn gives us the capability to create amplified data visuals. This helps us understand the data by displaying it in a visual context to unearth any hidden correlations between variables or trends that might not be obvious initially. Seaborn has a high-level interface as compared to the low level of Matplotlib.¶

Why should you use Seaborn versus matplotlib?¶

Seaborn makes our charts and plots look engaging and enables some of the common data visualization needs (like mapping color to a variable). Basically, it makes the data visualization and exploration easy to conquer.¶

There are essentially a couple of (big) limitations in matplotlib that Seaborn fixes:

  1. Seaborn comes with a large number of high-level interfaces and customized themes that

matplotlib lacks as it’s not easy to figure out the settings that make plots attractive

  1. Matplotlib functions don’t work well with dataframes, whereas seaborn does

Setting up the Environment¶

To install Seaborn and use it effectively, first, we need to install the aforementioned dependencies. Once this step is done, we are all set to install Seaborn and enjoy its mesmerizing plots. To install Seaborn, you can use the following line of codeTo install the latest release of seaborn, you can use pip:

!pip install seaborn

In [59]:
pip install seaborn
Requirement already satisfied: seaborn in c:\users\user\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\user\anaconda3\lib\site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\user\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [60]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
In [61]:
pd.__version__
Out[61]:
'2.2.2'

Datasets Used for Data Visualization¶

We’ll be working primarily with a dataset

HR_Employee_Attrition_Data.csv

Preparing the data¶

In [62]:
# importing the dataset
df_HR = pd.read_csv(r'C:\Users\User\OneDrive\Documents\AWP Module\HR_Employee_Attrition_Data.csv')
In [63]:
df_HR.head()
Out[63]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 3 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 5 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
In [64]:
df_HR.columns
Out[64]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [65]:
pd.set_option('display.max_columns',None)
df_HR.head()
Out[65]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 3 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 5 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
In [66]:
df_HR.shape
Out[66]:
(2940, 35)
In [67]:
df_HR_num = df_HR.select_dtypes(include = 'number')
df_HR_cat = df_HR.select_dtypes(include = 'object')
In [68]:
print(df_HR_num.columns)
print(df_HR_cat.columns)
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
      dtype='object')
In [ ]:
 
In [69]:
for x in df_HR_num.columns:
    if df_HR_num[x].nunique() < 5:
        print(f'Column {x} has {df_HR_num[x].nunique()} values')
Column EmployeeCount has 1 values
Column EnvironmentSatisfaction has 4 values
Column JobInvolvement has 4 values
Column JobSatisfaction has 4 values
Column PerformanceRating has 2 values
Column RelationshipSatisfaction has 4 values
Column StandardHours has 1 values
Column StockOptionLevel has 4 values
Column WorkLifeBalance has 4 values
In [70]:
df_HR_num.drop(['EmployeeCount', 'StandardHours'], inplace = True, axis = 1)
In [71]:
df_HR.columns
Out[71]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
In [72]:
df_HR_num.columns
Out[72]:
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')
In [73]:
for x in df_HR_cat.columns:
    if df_HR_cat[x].nunique() < 5:
        print(f'Column {x} has {df_HR_cat[x].nunique()} values')
Column Attrition has 2 values
Column BusinessTravel has 3 values
Column Department has 3 values
Column Gender has 2 values
Column MaritalStatus has 3 values
Column Over18 has 1 values
Column OverTime has 2 values
In [74]:
df_HR_cat.drop('Over18', inplace = True, axis = 1)
In [75]:
df_HR_cat.columns
Out[75]:
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'OverTime'],
      dtype='object')
In [76]:
df_HR_num.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Age                       2940 non-null   int64
 1   DailyRate                 2940 non-null   int64
 2   DistanceFromHome          2940 non-null   int64
 3   Education                 2940 non-null   int64
 4   EmployeeNumber            2940 non-null   int64
 5   EnvironmentSatisfaction   2940 non-null   int64
 6   HourlyRate                2940 non-null   int64
 7   JobInvolvement            2940 non-null   int64
 8   JobLevel                  2940 non-null   int64
 9   JobSatisfaction           2940 non-null   int64
 10  MonthlyIncome             2940 non-null   int64
 11  MonthlyRate               2940 non-null   int64
 12  NumCompaniesWorked        2940 non-null   int64
 13  PercentSalaryHike         2940 non-null   int64
 14  PerformanceRating         2940 non-null   int64
 15  RelationshipSatisfaction  2940 non-null   int64
 16  StockOptionLevel          2940 non-null   int64
 17  TotalWorkingYears         2940 non-null   int64
 18  TrainingTimesLastYear     2940 non-null   int64
 19  WorkLifeBalance           2940 non-null   int64
 20  YearsAtCompany            2940 non-null   int64
 21  YearsInCurrentRole        2940 non-null   int64
 22  YearsSinceLastPromotion   2940 non-null   int64
 23  YearsWithCurrManager      2940 non-null   int64
dtypes: int64(24)
memory usage: 551.4 KB
In [77]:
df_HR_cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Attrition       2940 non-null   object
 1   BusinessTravel  2940 non-null   object
 2   Department      2940 non-null   object
 3   EducationField  2940 non-null   object
 4   Gender          2940 non-null   object
 5   JobRole         2940 non-null   object
 6   MaritalStatus   2940 non-null   object
 7   OverTime        2940 non-null   object
dtypes: object(8)
memory usage: 183.9+ KB
In [78]:
# pd.set_option('display.max_columns', None)
df_HR_num.describe()
Out[78]:
Age DailyRate DistanceFromHome Education EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000
mean 36.923810 802.485714 9.192517 2.912925 1470.500000 2.721769 65.891156 2.729932 2.063946 2.728571 6502.931293 14313.103401 2.693197 15.209524 3.153741 2.712245 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.133819 403.440447 8.105485 1.023991 848.849221 1.092896 20.325969 0.711440 1.106752 1.102658 4707.155770 7116.575021 2.497584 3.659315 0.360762 1.081025 0.851932 7.779458 1.289051 0.706356 6.125483 3.622521 3.221882 3.567529
min 18.000000 102.000000 1.000000 1.000000 1.000000 1.000000 30.000000 1.000000 1.000000 1.000000 1009.000000 2094.000000 0.000000 11.000000 3.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 735.750000 2.000000 48.000000 2.000000 1.000000 2.000000 2911.000000 8045.000000 1.000000 12.000000 3.000000 2.000000 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1470.500000 3.000000 66.000000 3.000000 2.000000 3.000000 4919.000000 14235.500000 2.000000 14.000000 3.000000 3.000000 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 2205.250000 4.000000 84.000000 3.000000 3.000000 4.000000 8380.000000 20462.000000 4.000000 18.000000 3.000000 4.000000 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 2940.000000 4.000000 100.000000 4.000000 5.000000 4.000000 19999.000000 26999.000000 9.000000 25.000000 4.000000 4.000000 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000
In [79]:
df_HR_cat.describe()
Out[79]:
Attrition BusinessTravel Department EducationField Gender JobRole MaritalStatus OverTime
count 2940 2940 2940 2940 2940 2940 2940 2940
unique 2 3 3 6 2 9 3 2
top No Travel_Rarely Research & Development Life Sciences Male Sales Executive Married No
freq 2466 2086 1922 1212 1764 652 1346 2108
¶

Note above the difference between the output of describe function on Numerical(statistical) data and categorical data. For numerical data, output is statistical values like mean, std, min, max and percentiles. While for categorical data we have unique, top frequency level and the frequency count.

Data Visualization using Seaborn¶

This implementation section is divided into two categories:

● Visualizing statistical relationships

● Plotting categorical data

We’ll look at multiple examples of each category and how to plot it using seaborn.

¶

Visualizing statistical relationships

A statistical relationship denotes a process of understanding relationships between different variables in a dataset and how that relationship affects or depends on other variables.

Scatterplot using Seaborn¶

A scatterplot is perhaps the most common example of visualizing relationships between two variables. Each point shows an observation in the dataset and these observations are represented by dot-like structures. The plot shows the joint distribution of two variables using a cloud of points.To draw the scatter plot, we’ll be using the relplot() function of the seaborn library. It is a figure-level role for visualizing statistical relationships. By default, using a relplot produces a scatter plot:

In [80]:
df_HR_num.columns
Out[80]:
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')
In [81]:
# relpot,catplot,displot ,pairplot ,joinplot
In [82]:
sns.relplot(x = 'Age', y = 'MonthlyIncome', data = df_HR_num, height = 6, aspect = 1.5) 
plt.show()
No description has been provided for this image
In [83]:
#show hue :job sat ,size=joblevel,stype=perfrating,palette
sns.relplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='JobLevel',style='PerformanceRating',data = df_HR_num, height = 6, aspect = 1.5,palette='bright') 
plt.show()
No description has been provided for this image
In [84]:
sns.relplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='PerformanceRating',data = df_HR_num, height = 6, aspect = 1.5,palette='mako') 
plt.show()
No description has been provided for this image
In [85]:
df_HR_num['PerformanceRating'].value_counts()
Out[85]:
PerformanceRating
3    2488
4     452
Name: count, dtype: int64

Note above how we used the height parameter to specify the height instead of plt.figure(figsize). Aspect is the ratio of width to height i.e height of 8 * aspect ratio of 1.5 gives a width of 12

Many (if not most) plot functions in seaborn take height and aspect as parameters for setting the dimensions of the plot.

In [86]:
df_HR_num.shape
Out[86]:
(2940, 24)
In [87]:
df_HR_cat.shape
Out[87]:
(2940, 8)
In [88]:
# show  size hue,style ,palette
In [ ]:
 

Here we have also specified the palette of colors to be used. More on Seaborns palettes can be found here :

https://seaborn.pydata.org/tutorial/color_palettes.html

Finally, we have again bifurcated this information using different symbols in the 'style' parameter on 'Attrition'.

Seaborn has various parameters for each plot - for e.g. the symbols in the style parameter can be ordered as per our requirement. The acceptable markers will probably be the same as for matplotlib shared in the matplotlib class. As you go through your projects and evolution as a Data Scientist, feel free to experiment and research on these further parameters. You will eventually build your go-to or preferred options and parameters and mostly re-use these. However, it is always good to know of the other available options.

In [ ]:
 

The different kinds of color palettes available in Seaborn can be accessed here.

https://seaborn.pydata.org/tutorial/color_palettes.html

And we can use the sns.palplot() function to view these palettes.

In [89]:
palettePastel = sns.color_palette('pastel')
paletteDeep = sns.color_palette('deep')
paletteSet2 = sns.color_palette('Set2')
paletteMako = sns.color_palette('mako')
paletteMakoSeq = sns.color_palette("mako", as_cmap=True)
sns.palplot(palettePastel)
sns.palplot(paletteDeep)
sns.palplot(paletteSet2)
sns.palplot(paletteMako)
#sns.palplot(paletteMakoSeq)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [90]:
sns.relplot(data=df_HR_num,x='Age',y='MonthlyIncome',height=8,aspect=1.5,kind='line',errorbar=None)
Out[90]:
<seaborn.axisgrid.FacetGrid at 0x1cfc0c12d80>
No description has been provided for this image
In [91]:
plt.figure(figsize=(10,7))
sns.scatterplot(data=df_HR.head(100),x='Age',y='MonthlyIncome',hue='Attrition',palette='bright',\
           size='JobLevel',style='JobSatisfaction')
plt.show()
No description has been provided for this image
In [92]:
plt.figure(figsize=(10,10))
sns.scatterplot(x = 'Age', y = 'MonthlyIncome',hue='JobSatisfaction',size='PerformanceRating',data = df_HR_num,palette='mako') 
plt.show()
No description has been provided for this image

if we want to use sns.scatterplot we have to set figsize by plt.figure(figsize(a,b))¶

In [93]:
# sns.relplot(data,x,y,kind='kine'/'scatter')
# sns.scatterplot(data,x,y)
In [94]:
plt.figure(figsize = (8,8))
sns.lineplot(data = df_HR, x = 'Age', y = 'YearsAtCompany',errorbar=None)
plt.show()
No description has been provided for this image

sns.relplot()¶

  • sns.lineplot = sns.relplot(kind = 'line')
  • sns.scatterplot = sns.relplot(kind = 'scatter') (DEFAULT Relplot)
In [95]:
import warnings
warnings.filterwarnings('ignore')
In [96]:
sns.relplot(data = df_HR, x = 'Age', y = 'YearsAtCompany', height = 8, aspect = 1.5, kind = 'line',errorbar=None,hue='Attrition')
plt.show()
No description has been provided for this image
In [97]:
# Lineplot age vs years at company
sns.relplot(data=df_HR,x='Age',y='YearsAtCompany',kind='line',aspect=1.2,height=8,errorbar=None)
Out[97]:
<seaborn.axisgrid.FacetGrid at 0x1cfc97b9790>
No description has been provided for this image
In [98]:
sns.relplot(data=df_HR,x='Age',y='PercentSalaryHike',kind='line',aspect=1.5,height=8,errorbar=None,hue='Attrition')
Out[98]:
<seaborn.axisgrid.FacetGrid at 0x1cfcb711880>
No description has been provided for this image
In [99]:
#show hue=attrition ,ci=0

By changing the kind to 'line' we can use sns.relplot() function to draw line plots. Seaborn also has scatterplot() and lineplot() functions to draw these same plots. The parameters remain the same.

lmplot()¶

The lmplot() function in seaborn plots a scatterplot with a regression line overlaid.

In [100]:
sns.lmplot(data=df_HR,x='Age',y='MonthlyIncome',aspect=1.2,height=8) # best fit line for linear Regression
Out[100]:
<seaborn.axisgrid.FacetGrid at 0x1cfc96cd490>
No description has been provided for this image
In [101]:
sns.residplot(data=df_HR,x='Age',y='MonthlyIncome')
Out[101]:
<Axes: xlabel='Age', ylabel='MonthlyIncome'>
No description has been provided for this image
In [102]:
sns.lmplot(data = df_HR, x = 'YearsAtCompany', y = 'MonthlyIncome', height = 6, aspect = 2)
Out[102]:
<seaborn.axisgrid.FacetGrid at 0x1cfcc71a060>
No description has been provided for this image
In [103]:
plt.figure(figsize = (10,6))
sns.residplot(data = df_HR, x = 'YearsAtCompany', y = 'MonthlyIncome') # resid shows error. Dotline upper part is +ve Error(residue)
# & Dotline lower part is -ve error(residue)
Out[103]:
<Axes: xlabel='YearsAtCompany', ylabel='MonthlyIncome'>
No description has been provided for this image

Plotting Categorical Data¶

In the above section, we saw how we can use different visual representations to show the relationship between multiple variables. We drew the plots between two numeric variables. In this section, we’ll see the relationship between two variables of which one would be categorical (divided into different groups). We’ll be using the catplot() function of the seaborn library to draw the plots of categorical data. Previously this was factorplot(but default to scatterplot) - now factorplot() is less used.

In [104]:
df_HR.head()
Out[104]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 3 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 5 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
catplot¶
In [105]:
sns.catplot(data=df_HR,y='MonthlyIncome',kind='box',hue='Attrition',col='Department')
Out[105]:
<seaborn.axisgrid.FacetGrid at 0x1cfc9879a30>
No description has been provided for this image
In [106]:
sns.catplot(data=df_HR,x='MonthlyIncome',kind='box',y='JobRole',aspect=1.3,height=8)
Out[106]:
<seaborn.axisgrid.FacetGrid at 0x1cfcde277d0>
No description has been provided for this image
In [107]:
sns.catplot(data=df_HR,x='MonthlyIncome',kind='box',y='JobRole',aspect=1.3,height=8,hue='Attrition',col='Gender')
Out[107]:
<seaborn.axisgrid.FacetGrid at 0x1cfcc783c80>
No description has been provided for this image
In [108]:
# striplot
sns.catplot(data=df_HR,y='DistanceFromHome',x='Attrition',jitter=False) # jitterplot or strip plot
Out[108]:
<seaborn.axisgrid.FacetGrid at 0x1cfce55d490>
No description has been provided for this image
In [109]:
sns.catplot(data=df_HR,y='DistanceFromHome',x='JobSatisfaction',kind='violin',hue='Attrition',split=True,
            inner='quartile',aspect=2,height=6,col='Department')
Out[109]:
<seaborn.axisgrid.FacetGrid at 0x1cfc97f8770>
No description has been provided for this image
In [ ]:
 
In [110]:
df_HR_cat.columns
Out[110]:
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
       'JobRole', 'MaritalStatus', 'OverTime'],
      dtype='object')
In [111]:
sns.catplot(x = 'Attrition', y = 'YearsAtCompany', data = df_HR)    # striplot
plt.show()
No description has been provided for this image

The above is a striplot (also called jitterplot) showing the points on the plot corresponding to x and y values. The points are scattered across the y dimension because they are deviating from the true X value (i.e. Jittering) so that they dont overlap completely. If we set jitter to false, they will be plotted on the true X value and we will see only one point for every y point on x tick.

In [112]:
sns.catplot(x = 'Attrition', y = 'YearsAtCompany', data = df_HR,jitter=False)    # striplot
plt.show()
No description has been provided for this image
In [113]:
#show jitter=True

The kind parameter in catplot - takes the following non-default values.

"strip" - default, seen above Default func = sns.stripplot()

"swarm" - similar to jitter = True but the points are even more spread apart across the y-axis. Direct func - sns.swarmplot()

"box" - Shows the statistical representation of the chosen y across the x ticks. Shows the outlier points, positive 1.5 IQR, negative 1.5 IQR, 25%, 50% Mean, 75% values in the box which represents the IQR. Direct func - sns.boxplot()

"violin" - Distribution of the y points across the x ticks. Shows the box plot in the center of each plot. Distribution is mirrored on both sides of the center of the plot. Direct func - sns.violinplot()

"point" - Shows the point estimate as a point for each x tick and the level of uncertainity around that point estimate is shown by the lines above and below the point estimate - default is mean. Direct func - sns.pointplot()

"bar" - Bar plot - Direct func - sns.barplot()

"count" - Count Plot - Frequency of X. Direct func - sns.countplot()

From the documentation of catplot, we can see that each of these 'kind' parameters also have their own function plots i.e.

instead of calling sns.catplot(data, x, y, kind = 'swarm') we could call

sns.swarmplot(data, x, y) with the same parameters.

In [114]:
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, kind = 'strip', height = 8, aspect = 1.5)
plt.show()
No description has been provided for this image
In [162]:
sns.catplot(data=df_HR,x='Attrition',y='Age',kind='strip',jitter=False,col='Gender') # striplot jitterplot is same parameter
Out[162]:
<seaborn.axisgrid.FacetGrid at 0x1cfc1a17b30>
No description has been provided for this image
In [115]:
# take hue=attrition
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR,hue='Attrition', kind = 'strip', height = 8, aspect = 1.5)
plt.show()
No description has been provided for this image
In [116]:
sns.catplot(x = 'MonthlyIncome', data = df_HR, kind = 'box', height = 8, aspect = 2)
plt.show()
No description has been provided for this image
In [117]:
sns.catplot(x = 'Age', data = df_HR, kind = 'box', height = 8, aspect = 2)
plt.show()
No description has been provided for this image
In [118]:
sns.catplot(y = 'Age', x='Department',data = df_HR, kind = 'box', height = 8, aspect = 2,hue='Attrition')
plt.show()
No description has been provided for this image
In [119]:
df_HR.columns
Out[119]:
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
Violin plots¶

The violin plots combine the boxplot and kernel density estimation procedure to provide richer description of the distribution of values. The quartile values are displayed inside the violin.

In [120]:
sns.catplot(x = 'MonthlyIncome', data = df_HR, kind = 'violin', height = 6, aspect = 2)
plt.show()
No description has been provided for this image
In [165]:
# swarmplot,violinplot
sns.catplot(data=df_HR,x='JobSatisfaction',y='DistanceFromHome',kind='violin',aspect=1.2,hue='Attrition',split=True,inner='quartile',\
            palette='muted')
Out[165]:
<seaborn.axisgrid.FacetGrid at 0x1cfdfdc39b0>
No description has been provided for this image
In [166]:
sns.catplot(data=df_HR,x='JobSatisfaction',y='DistanceFromHome',kind='violin',aspect=1.2,hue='Attrition',split=True,inner='quartile',\
            palette='muted',col='Gender')
Out[166]:
<seaborn.axisgrid.FacetGrid at 0x1cfdfe5a300>
No description has been provided for this image

We can also overlay one plot over another as we did with matplotlib. But for this we need the figure object of matplotlib.

In [121]:
plt.figure(figsize = (16,10))
sns.swarmplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, palette = 'bright')
sns.violinplot(x = 'JobSatisfaction', y = 'DistanceFromHome', data = df_HR, palette = 'pastel')
plt.show()
No description has been provided for this image
In [122]:
sns.catplot(data=df_HR,x='Department',kind='count',height=6,aspect=2,hue='Attrition',col='JobSatisfaction')
Out[122]:
<seaborn.axisgrid.FacetGrid at 0x1cfc0e7df70>
No description has been provided for this image
In [123]:
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
            aspect = 2)
plt.show()
No description has been provided for this image

By setting the split parameter to true - we can get the distributions of Attrition - Yes, Attrition - No on both sides of the plot.

In [124]:
#split=True
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
            aspect = 2,split=True)
plt.show()
No description has been provided for this image

However, we are only getting one box plot in the center for the overall attrition. We can rectify this by adding the inner = 'quartile' parameter. Now it will show us the lines on each distribution showing the 25%, 50%(median) and 75% of the distribution.

In [125]:
# inner='quartile'
#split=True
sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'violin', height = 6,\
            aspect = 2,split=True,inner='quartile')
plt.show()
No description has been provided for this image
Pointplot¶

Pointplot connects data from the same hue category. This helps in identifying how the relationship is changing in a particular hue category.

In [126]:
sns.catplot(x = 'PercentSalaryHike', y = 'JobSatisfaction',hue = 'Attrition', data = df_HR, kind = 'point', height = 6,\
           aspect = 1.5)
plt.show()
No description has been provided for this image
In [170]:
# countplot 
sns.catplot(data=df_HR,x='Education',kind='count',hue='Attrition',col='Gender')
Out[170]:
<seaborn.axisgrid.FacetGrid at 0x1cfc96ba570>
No description has been provided for this image
In [171]:
sns.catplot(data=df_HR,x='EnvironmentSatisfaction',y='TotalWorkingYears',kind='bar',hue='Attrition',errorbar=None)
Out[171]:
<seaborn.axisgrid.FacetGrid at 0x1cfceee47a0>
No description has been provided for this image
In [127]:
 sns.catplot(x = 'JobSatisfaction', y = 'DistanceFromHome', hue = 'Attrition', data = df_HR, kind = 'bar',errorbar=None)   # errorbar=None
plt.show()
No description has been provided for this image
In [ ]:
 
In [128]:
sns.catplot(x = 'JobSatisfaction', data = df_HR,hue='Attrition',kind = 'count',height = 6, aspect = 2)
plt.show()
No description has been provided for this image
In [129]:
sns.catplot(x = 'Attrition', data = df_HR, kind = 'count',height = 6, aspect = 0.5,col='JobSatisfaction')
plt.show()
No description has been provided for this image

Using seaborn we can visualise higher dimension relationships as well with the 'col' parameter

In [130]:
#hue :attrition col=jo satisfaction
In [131]:
df_HR.head()
Out[131]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 3 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 4 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 5 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2

Visualizing the Distribution of a Dataset¶

Univariate distributions - Histograms¶
In [132]:
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5,kde=True)
Out[132]:
<seaborn.axisgrid.FacetGrid at 0x1cfcf0d7ce0>
No description has been provided for this image
In [133]:
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5,kde=True,col='JobSatisfaction')
Out[133]:
<seaborn.axisgrid.FacetGrid at 0x1cfd7c67020>
No description has been provided for this image

As with relplot and catplot, the kind parameter for displot takes the following inputs:

hist - Data is divided into bins. This is the default. Direct function histplot()

kde - Kernel Density Estimator - Probability estimate of a random variable. We can choose kde = True in a histplot to get both the kde and histogram bars. Direct function kdeplot().

ecdf - For visualising each of the datapoints in a cumulative manner. Explanation below. Direct func = ecdfplot() Empirical Cumulative Distribution

rugplot - Can either be used with other distribution plots with rug = True or drawn seperately with rugplot(). Draws ticks along the x axis for each datapoint thereby the density at different points in the data can be analyzed.

Please remember that these graphs are for Univariate distributions.

In [134]:
#Empirical Cumulative Distribution Function"
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 2.5, kind = 'kde', rug = False)
Out[134]:
<seaborn.axisgrid.FacetGrid at 0x1cfd5678a40>
No description has been provided for this image
In [173]:
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'hist', kde=True)
Out[173]:
<seaborn.axisgrid.FacetGrid at 0x1cfca75a750>
No description has been provided for this image
In [178]:
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'kde', rug = True,col='Gender')
Out[178]:
<seaborn.axisgrid.FacetGrid at 0x1cfd7c64200>
No description has been provided for this image
In [175]:
plt.figure(figsize = (12,8))
sns.rugplot(data = df_HR, x = 'MonthlyIncome', height = 1)
plt.show()
No description has been provided for this image
In [179]:
sns.rugplot(data = df_HR, x = 'MonthlyIncome', height = 1) # rug ===> density
Out[179]:
<Axes: xlabel='MonthlyIncome'>
No description has been provided for this image
In [136]:
plt.figure(figsize = (12,8))
sns.displot(data = df_HR, x = 'MonthlyIncome',kind = 'ecdf')
sns.displot(data = df_HR, x = 'MonthlyIncome', kind = 'kde')
plt.show()
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
In [137]:
sns.displot(data = df_HR, x = 'MonthlyIncome', height = 6, aspect = 1.5, kind = 'ecdf', col = 'Attrition')
Out[137]:
<seaborn.axisgrid.FacetGrid at 0x1cfdd6b6a80>
No description has been provided for this image

ECDF stands for Empirical Cumulative Distribution. The ECDF plot visualizes each and every data point of the dataset directly in a cumulative manner.

This plot contains more information because it has no bin size setting, which means it doesn’t have any smoothing parameters.

Since its curves are monotonically increasing, so it is well suited for comparing multiple distributions at the same time. In an ECDF plot, the x-axis corresponds to the range of values for the variable whereas the y-axis corresponds to the proportion of data points that are less than or equal to the corresponding value of the x-axis.

Plotting Bivariate Distributions¶

Apart from visualizing the distribution of a single variable, we can see how two independent variables are distributed with respect to each other. Bivariate means joint, so to visualize it, we use the jointplot() function of seaborn library. By default, jointplot draws a scatter plot.

In [138]:
df_HR.shape
Out[138]:
(2940, 35)
In [139]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'YearsSinceLastPromotion', height = 8, ratio = 3)
plt.show()
No description has been provided for this image
In [140]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'YearsSinceLastPromotion', height = 8, ratio = 3, \
             hue = 'JobSatisfaction', palette = 'bright')
plt.show()
No description has been provided for this image

Kind parameter takes the following values:

scatter - Default of jointplot

kde - Using a Kernel density estimator as the joint axis(main box in the above plots).

hist - Using Histogram as the joint axis

hex - Hexplot is a bivariate analog of histogram as it shows the number of observations that fall within

hexagonal bins. This is a plot which works with a large dataset very easily

reg - Is used to plot the data along with a linear regression model fit. The line across the graph is the 'line of best fit' which we shall learn further about in the Stats and ML module. Direct func - regplot()

resid - This method is used to plot the residuals of the linear regression model which we shall learn about in Stats and ML. The line of best fit is the dotted line through 0 on the y axis in the graph with residuals on either side. Direct func = residplot()

In [141]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'kde')
plt.show()
No description has been provided for this image
In [183]:
sns.jointplot(data = df_HR, x = 'Age', y = 'MonthlyIncome',kind='scatter')
plt.show()
No description has been provided for this image
In [185]:
sns.jointplot(data = df_HR, x = 'Age', y = 'MonthlyIncome',kind='hex',palette='rocket',color='red')
plt.show()
No description has been provided for this image
In [142]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'hist')
plt.show()
No description has been provided for this image
In [143]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'hex')
plt.show()
No description has been provided for this image
In [144]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'reg')
plt.show()
No description has been provided for this image
In [145]:
sns.jointplot(data = df_HR, x = 'YearsAtCompany', y = 'Age', height = 8, ratio = 3, kind = 'resid')
plt.show()
No description has been provided for this image

Other maps in Seaborn¶

In [146]:
df_HR_num.describe()
Out[146]:
Age DailyRate DistanceFromHome Education EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000
mean 36.923810 802.485714 9.192517 2.912925 1470.500000 2.721769 65.891156 2.729932 2.063946 2.728571 6502.931293 14313.103401 2.693197 15.209524 3.153741 2.712245 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.133819 403.440447 8.105485 1.023991 848.849221 1.092896 20.325969 0.711440 1.106752 1.102658 4707.155770 7116.575021 2.497584 3.659315 0.360762 1.081025 0.851932 7.779458 1.289051 0.706356 6.125483 3.622521 3.221882 3.567529
min 18.000000 102.000000 1.000000 1.000000 1.000000 1.000000 30.000000 1.000000 1.000000 1.000000 1009.000000 2094.000000 0.000000 11.000000 3.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 735.750000 2.000000 48.000000 2.000000 1.000000 2.000000 2911.000000 8045.000000 1.000000 12.000000 3.000000 2.000000 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1470.500000 3.000000 66.000000 3.000000 2.000000 3.000000 4919.000000 14235.500000 2.000000 14.000000 3.000000 3.000000 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 2205.250000 4.000000 84.000000 3.000000 3.000000 4.000000 8380.000000 20462.000000 4.000000 18.000000 3.000000 4.000000 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 2940.000000 4.000000 100.000000 4.000000 5.000000 4.000000 19999.000000 26999.000000 9.000000 25.000000 4.000000 4.000000 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000
In [147]:
# df_HR_num.drop('EmployeeNumber', inplace = True, axis = 1)
In [148]:
df_HR_num.describe()
Out[148]:
Age DailyRate DistanceFromHome Education EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
count 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000 2940.000000
mean 36.923810 802.485714 9.192517 2.912925 1470.500000 2.721769 65.891156 2.729932 2.063946 2.728571 6502.931293 14313.103401 2.693197 15.209524 3.153741 2.712245 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.133819 403.440447 8.105485 1.023991 848.849221 1.092896 20.325969 0.711440 1.106752 1.102658 4707.155770 7116.575021 2.497584 3.659315 0.360762 1.081025 0.851932 7.779458 1.289051 0.706356 6.125483 3.622521 3.221882 3.567529
min 18.000000 102.000000 1.000000 1.000000 1.000000 1.000000 30.000000 1.000000 1.000000 1.000000 1009.000000 2094.000000 0.000000 11.000000 3.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 465.000000 2.000000 2.000000 735.750000 2.000000 48.000000 2.000000 1.000000 2.000000 2911.000000 8045.000000 1.000000 12.000000 3.000000 2.000000 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 802.000000 7.000000 3.000000 1470.500000 3.000000 66.000000 3.000000 2.000000 3.000000 4919.000000 14235.500000 2.000000 14.000000 3.000000 3.000000 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 1157.000000 14.000000 4.000000 2205.250000 4.000000 84.000000 3.000000 3.000000 4.000000 8380.000000 20462.000000 4.000000 18.000000 3.000000 4.000000 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1499.000000 29.000000 5.000000 2940.000000 4.000000 100.000000 4.000000 5.000000 4.000000 19999.000000 26999.000000 9.000000 25.000000 4.000000 4.000000 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000

Heatmaps¶

Heatmaps are graphical representations of data which use color-coding to show different values. Usually, they are used to show values that are between a certain scale and the change in the hue of a single color makes it easier to identify the higher and lower values.

Of course, we have a choice of using multi-colors in cmap to represent the data but it usually isnt as clear to analyze.

In [149]:
df_HR_num.corr()
Out[149]:
Age DailyRate DistanceFromHome Education EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Age 1.000000 0.010661 -0.001686 0.208034 -0.005175 0.010146 0.024287 0.029820 0.509604 -0.004892 0.497855 0.028051 0.299635 0.003634 0.001904 0.053535 0.037510 0.680381 -0.019621 -0.021490 0.311309 0.212901 0.216513 0.202089
DailyRate 0.010661 1.000000 -0.004985 -0.016806 -0.025742 0.018355 0.023381 0.046135 0.002966 0.030571 0.007707 -0.032182 0.038153 0.022704 0.000473 0.007846 0.042143 0.014515 0.002453 -0.037848 -0.034055 0.009932 -0.033229 -0.026363
DistanceFromHome -0.001686 -0.004985 1.000000 0.021042 0.016464 -0.016075 0.031131 0.008783 0.005303 -0.003669 -0.017014 0.027473 -0.029251 0.040235 0.027110 0.006557 0.044872 0.004628 -0.036942 -0.026556 0.009508 0.018845 0.010029 0.014406
Education 0.208034 -0.016806 0.021042 1.000000 0.020950 -0.027128 0.016775 0.042438 0.101589 -0.011296 0.094961 -0.026084 0.126317 -0.011111 -0.024539 -0.009118 0.018422 0.148280 -0.025100 0.009819 0.069114 0.060236 0.054254 0.069065
EmployeeNumber -0.005175 -0.025742 0.016464 0.020950 1.000000 0.008712 0.017377 -0.003552 -0.009020 -0.022970 -0.007188 0.006177 -0.000345 -0.006685 -0.010338 -0.034827 0.031226 -0.007047 0.011953 0.005370 -0.005779 -0.004427 -0.004575 -0.004716
EnvironmentSatisfaction 0.010146 0.018355 -0.016075 -0.027128 0.008712 1.000000 -0.049857 -0.008278 0.001212 -0.006784 -0.006259 0.037600 0.012594 -0.031701 -0.029548 0.007665 0.003432 -0.002693 -0.019359 0.027627 0.001458 0.018007 0.016194 -0.004999
HourlyRate 0.024287 0.023381 0.031131 0.016775 0.017377 -0.049857 1.000000 0.042861 -0.027853 -0.071335 -0.015794 -0.015297 0.022157 -0.009062 -0.002172 0.001330 0.050263 -0.002334 -0.008548 -0.004607 -0.019582 -0.024106 -0.026716 -0.020123
JobInvolvement 0.029820 0.046135 0.008783 0.042438 -0.003552 -0.008278 0.042861 1.000000 -0.012630 -0.021476 -0.015271 -0.016322 0.015012 -0.017205 -0.029071 0.034297 0.021523 -0.005533 -0.015338 -0.014617 -0.021355 0.008717 -0.024184 0.025976
JobLevel 0.509604 0.002966 0.005303 0.101589 -0.009020 0.001212 -0.027853 -0.012630 1.000000 -0.001944 0.950300 0.039563 0.142501 -0.034730 -0.021222 0.021642 0.013984 0.782208 -0.018191 0.037818 0.534739 0.389447 0.353885 0.375281
JobSatisfaction -0.004892 0.030571 -0.003669 -0.011296 -0.022970 -0.006784 -0.071335 -0.021476 -0.001944 1.000000 -0.007157 0.000644 -0.055699 0.020002 0.002297 -0.012454 0.010690 -0.020185 -0.005779 -0.019459 -0.003803 -0.002305 -0.018214 -0.027656
MonthlyIncome 0.497855 0.007707 -0.017014 0.094961 -0.007188 -0.006259 -0.015794 -0.015271 0.950300 -0.007157 1.000000 0.034814 0.149515 -0.027269 -0.017120 0.025873 0.005408 0.772893 -0.021736 0.030683 0.514285 0.363818 0.344978 0.344079
MonthlyRate 0.028051 -0.032182 0.027473 -0.026084 0.006177 0.037600 -0.015297 -0.016322 0.039563 0.000644 0.034814 1.000000 0.017521 -0.006429 -0.009811 -0.004085 -0.034323 0.026442 0.001467 0.007963 -0.023655 -0.012815 0.001567 -0.036746
NumCompaniesWorked 0.299635 0.038153 -0.029251 0.126317 -0.000345 0.012594 0.022157 0.015012 0.142501 -0.055699 0.149515 0.017521 1.000000 -0.010238 -0.014095 0.052733 0.030075 0.237639 -0.066054 -0.008366 -0.118421 -0.090754 -0.036814 -0.110319
PercentSalaryHike 0.003634 0.022704 0.040235 -0.011111 -0.006685 -0.031701 -0.009062 -0.017205 -0.034730 0.020002 -0.027269 -0.006429 -0.010238 1.000000 0.773550 -0.040490 0.007528 -0.020608 -0.005221 -0.003280 -0.035991 -0.001520 -0.022154 -0.011985
PerformanceRating 0.001904 0.000473 0.027110 -0.024539 -0.010338 -0.029548 -0.002172 -0.029071 -0.021222 0.002297 -0.017120 -0.009811 -0.014095 0.773550 1.000000 -0.031351 0.003506 0.006744 -0.015579 0.002572 0.003435 0.034986 0.017896 0.022827
RelationshipSatisfaction 0.053535 0.007846 0.006557 -0.009118 -0.034827 0.007665 0.001330 0.034297 0.021642 -0.012454 0.025873 -0.004085 0.052733 -0.040490 -0.031351 1.000000 -0.045952 0.024054 0.002497 0.019604 0.019367 -0.015123 0.033493 -0.000867
StockOptionLevel 0.037510 0.042143 0.044872 0.018422 0.031226 0.003432 0.050263 0.021523 0.013984 0.010690 0.005408 -0.034323 0.030075 0.007528 0.003506 -0.045952 1.000000 0.010136 0.011274 0.004129 0.015058 0.050818 0.014352 0.024698
TotalWorkingYears 0.680381 0.014515 0.004628 0.148280 -0.007047 -0.002693 -0.002334 -0.005533 0.782208 -0.020185 0.772893 0.026442 0.237639 -0.020608 0.006744 0.024054 0.010136 1.000000 -0.035662 0.001008 0.628133 0.460365 0.404858 0.459188
TrainingTimesLastYear -0.019621 0.002453 -0.036942 -0.025100 0.011953 -0.019359 -0.008548 -0.015338 -0.018191 -0.005779 -0.021736 0.001467 -0.066054 -0.005221 -0.015579 0.002497 0.011274 -0.035662 1.000000 0.028072 0.003569 -0.005738 -0.002067 -0.004096
WorkLifeBalance -0.021490 -0.037848 -0.026556 0.009819 0.005370 0.027627 -0.004607 -0.014617 0.037818 -0.019459 0.030683 0.007963 -0.008366 -0.003280 0.002572 0.019604 0.004129 0.001008 0.028072 1.000000 0.012089 0.049856 0.008941 0.002759
YearsAtCompany 0.311309 -0.034055 0.009508 0.069114 -0.005779 0.001458 -0.019582 -0.021355 0.534739 -0.003803 0.514285 -0.023655 -0.118421 -0.035991 0.003435 0.019367 0.015058 0.628133 0.003569 0.012089 1.000000 0.758754 0.618409 0.769212
YearsInCurrentRole 0.212901 0.009932 0.018845 0.060236 -0.004427 0.018007 -0.024106 0.008717 0.389447 -0.002305 0.363818 -0.012815 -0.090754 -0.001520 0.034986 -0.015123 0.050818 0.460365 -0.005738 0.049856 0.758754 1.000000 0.548056 0.714365
YearsSinceLastPromotion 0.216513 -0.033229 0.010029 0.054254 -0.004575 0.016194 -0.026716 -0.024184 0.353885 -0.018214 0.344978 0.001567 -0.036814 -0.022154 0.017896 0.033493 0.014352 0.404858 -0.002067 0.008941 0.618409 0.548056 1.000000 0.510224
YearsWithCurrManager 0.202089 -0.026363 0.014406 0.069065 -0.004716 -0.004999 -0.020123 0.025976 0.375281 -0.027656 0.344079 -0.036746 -0.110319 -0.011985 0.022827 -0.000867 0.024698 0.459188 -0.004096 0.002759 0.769212 0.714365 0.510224 1.000000
In [150]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize = (20, 10))
sns.heatmap(df_HR_num.corr(), cmap = sns.color_palette('rocket', as_cmap=True),annot=True)
plt.show()

# show annot=True
No description has been provided for this image
In [187]:
plt.figure(figsize=(20,10))
sns.heatmap(df_HR_num.corr(),annot=True,cmap='rocket')
plt.show()
No description has been provided for this image

Above we have taken the correlation of each column with other columns (-1 representing perfect inverse correlation, 0 meaning no correlation, 1 meaning perfect positive correlation). So, the data is between the scale -1 to 1. And is easily identified by the darkening of the Blue color for positive correlation and lightening for negative correlation.

Boxen Plot using Seaborn¶

Another plot that we can use to show the bivariate distribution is the boxen plot. Boxen plots were originally named letter value plots as it shows a large number of values of a variable, also known as quantiles. These quantiles are also defined as letter values. By plotting a large number of quantiles, it provides more insights about the shape of the distribution. These are similar to box plots.

We can draw these plots using catplot() with kind 'boxen' or directly by calling boxenplot()

In [151]:
sns.catplot(data = df_HR, x = 'DailyRate', kind = 'boxen')
Out[151]:
<seaborn.axisgrid.FacetGrid at 0x1cfdf5b40b0>
No description has been provided for this image

Visualizing Pairwise Relationships in a Dataset¶

We can also plot multiple bivariate distributions in a dataset by using the pairplot() function of the seaborn library. This shows the relationship between each column of the database. It also draws the univariate distribution plot of each variable on the diagonal axis. Let’s see how it looks.

In [152]:
df_HR_num.columns
Out[152]:
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')
In [153]:
df_HR_num2 = df_HR_num[['Age', 'HourlyRate', 'JobSatisfaction', 'NumCompaniesWorked', 'PerformanceRating', 'StockOptionLevel',\
'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsWithCurrManager']].copy()
In [154]:
df_HR_num2.head()
Out[154]:
Age HourlyRate JobSatisfaction NumCompaniesWorked PerformanceRating StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsWithCurrManager
0 41 94 4 8 3 0 8 0 1 5
1 49 61 2 1 4 1 10 3 3 7
2 37 92 3 6 3 0 7 3 3 0
3 33 56 3 1 3 0 8 3 3 0
4 27 40 2 9 3 1 6 3 3 2
In [155]:
sns.pairplot(df_HR_num2, height = 2, aspect = 1, kind = 'reg', diag_kind = 'hist')
plt.savefig('Pairplot.jpg')
plt.show()
No description has been provided for this image
In [195]:
sns.pairplot(df_HR_num)
Out[195]:
<seaborn.axisgrid.PairGrid at 0x1cfdf1e2180>
No description has been provided for this image

kind parameter (bivariate plot) takes the following arguments: 'scatter', 'kde', 'hist', 'reg' Default is 'scatter'

diag_kind parameter displays Univariate distribution of the column and takes the following parameters: 'auto', 'hist', 'kde', None. Default is auto.

In [156]:
df_HR_num2.plot(kind="box", subplots=True, layout=(7,5),figsize=(20,20))
Out[156]:
Age                         Axes(0.125,0.786098;0.133621x0.0939024)
HourlyRate               Axes(0.285345,0.786098;0.133621x0.0939024)
JobSatisfaction           Axes(0.44569,0.786098;0.133621x0.0939024)
NumCompaniesWorked       Axes(0.606034,0.786098;0.133621x0.0939024)
PerformanceRating        Axes(0.766379,0.786098;0.133621x0.0939024)
StockOptionLevel            Axes(0.125,0.673415;0.133621x0.0939024)
TotalWorkingYears        Axes(0.285345,0.673415;0.133621x0.0939024)
TrainingTimesLastYear     Axes(0.44569,0.673415;0.133621x0.0939024)
WorkLifeBalance          Axes(0.606034,0.673415;0.133621x0.0939024)
YearsWithCurrManager     Axes(0.766379,0.673415;0.133621x0.0939024)
dtype: object
No description has been provided for this image
In [157]:
print(dir(pd))
['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'io', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'offsets', 'option_context', 'options', 'pandas', 'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 'read_xml', 'reset_option', 'set_eng_float_format', 'set_option', 'show_versions', 'test', 'testing', 'timedelta_range', 'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries', 'unique', 'util', 'value_counts', 'wide_to_long']

Finally, the best plot of them ALL.¶

In [194]:
import seaborn as sns
sns.dogplot()
No description has been provided for this image

The creators of seaborn put in an Easter egg — call sns.dogplot() and seaborn will randomly return a high-resolution picture of an adorable dog!

In [159]:
print(dir(sns))
['FacetGrid', 'JointGrid', 'PairGrid', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_base', '_compat', '_core', '_docstrings', '_orig_rc_params', '_statistics', '_stats', 'algorithms', 'axes_style', 'axisgrid', 'barplot', 'blend_palette', 'boxenplot', 'boxplot', 'categorical', 'catplot', 'choose_colorbrewer_palette', 'choose_cubehelix_palette', 'choose_dark_palette', 'choose_diverging_palette', 'choose_light_palette', 'clustermap', 'cm', 'color_palette', 'colors', 'countplot', 'crayon_palette', 'crayons', 'cubehelix_palette', 'dark_palette', 'desaturate', 'despine', 'displot', 'distplot', 'distributions', 'diverging_palette', 'dogplot', 'ecdfplot', 'external', 'get_data_home', 'get_dataset_names', 'heatmap', 'histplot', 'hls_palette', 'husl_palette', 'jointplot', 'kdeplot', 'light_palette', 'lineplot', 'lmplot', 'load_dataset', 'matrix', 'miscplot', 'move_legend', 'mpl', 'mpl_palette', 'pairplot', 'palettes', 'palplot', 'plotting_context', 'pointplot', 'rcmod', 'regplot', 'regression', 'relational', 'relplot', 'reset_defaults', 'reset_orig', 'residplot', 'rugplot', 'saturate', 'scatterplot', 'set', 'set_color_codes', 'set_context', 'set_hls_values', 'set_palette', 'set_style', 'set_theme', 'stripplot', 'swarmplot', 'utils', 'violinplot', 'widgets', 'xkcd_palette', 'xkcd_rgb']
In [160]:
print(dir(df_HR))
['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'T', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__arrow_c_stream__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__dataframe_consortium_standard__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_agg_examples_doc', '_agg_see_also_doc', '_align_for_op', '_align_frame', '_align_series', '_append', '_arith_method', '_arith_method_with_reindex', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_result', '_constructor', '_constructor_from_mgr', '_constructor_sliced', '_constructor_sliced_from_mgr', '_create_data_for_split_and_tight_to_dict', '_data', '_deprecate_downcast', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_flex_arith_method', '_flex_cmp_method', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_get_values_for_csv', '_getitem_bool_array', '_getitem_multilevel', '_getitem_nocopy', '_getitem_slice', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_is_view_after_cow_rules', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_logical_func', '_logical_method', '_maybe_align_series_as_frame', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_pad_or_backfill', '_protect_consolidate', '_reduce', '_reduce_axis1', '_reindex_axes', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_shift_with_freq', '_should_reindex_frame_op', '_slice', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_to_latex_via_styler', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isetitem', 'isin', 'isna', 'isnull', 'items', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'map', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_orc', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'xs']
In [ ]: